Goto

Collaborating Authors

 rgb stream


VPN++: Rethinking Video-Pose embeddings for understanding Activities of Daily Living

arXiv.org Artificial Intelligence

Abstract--Many attempts have been made towards combining RGB and 3D poses for the recognition of Activities of Daily Living (ADL). ADL may look very similar and often necessitate to model fine-grained details to distinguish them. Because the recent 3D ConvNets are too rigid to capture the subtle visual patterns across an action, this research direction is dominated by methods combining RGB and 3D Poses. But the cost of computing 3D poses from RGB stream is high in the absence of appropriate sensors. This limits the usage of aforementioned approaches in real-world applications requiring low latency. Then, how to best take advantage of 3D Poses for recognizing ADL? To this end, we propose an extension of a pose driven attention mechanism: Video-Pose Network (VPN), exploring two distinct directions. One is to transfer the Pose knowledge into RGB through a feature-level distillation and the other towards mimicking pose driven attention through an attention-level distillation. Finally, these two approaches are integrated into a single model, we call VPN . We show that VPN is not only effective but also provides a high speed up and high resilience to noisy Poses. VPN, with or without 3D Poses, outperforms the representative baselines on 4 public datasets.


MARS: Motion-Augmented RGB Stream for Action Recognition - Naver Labs Europe

#artificialintelligence

This blog presents our CVPR'19 paper on "MARS: Motion-Augmented RGB Stream for Action Recognition" done with the Thoth team at Inria. The code and trained models are available here. Action recognition in videos means you need to process both spatial and temporal information and, although CNNs have been pretty successful in modeling spatial information, their performance in modeling temporal information has been subpar. Current state-of-the-art techniques use 3D CNN based two stream architectures that are trained on a large dataset and where one stream processes appearance information using RGB frames while the other deals with motion information using optical flow. However, computing optical flows creates a latency for recognizing videos which obviously limits its use in real-time applications.


Bilinear Faster RCNN with ELA for Image Tampering Detection

arXiv.org Machine Learning

With technological advances leading to an increase in mechanisms of image tampering, our fraud detection methods must continue to be upgraded to match their sophistication. One problem with current methods is that they require prior knowledge of the method of forgery in order to determine which features to extract from the image to localize the region of interest. When a machine learning algorithm is used to learn different types tampering from a large set of various image types, with a big enough database we can easily classify which images are tampered (by training on the entire image feature map for each image), but we still are left with the question of which features to train on, and how to localize the manipulation. To solve this, object detection networks such as Faster RCNN, which combine an RPN (Region Proposal Network) with a CNN have recently been adapted to fraud detection by utilizing their ability to propose bounding boxes for objects of interest to localize the tampering artifacts. In this work, an existing bilinear Faster RCNN model that was developed will be modified with the second stream having an input of the ELA (Error Level Analysis) JPEG compression level mask.